Ford GoBike is the Bay Area's bike share system. Bay Area Bike Share was introduced in 2013 as a pilot program for the region, with 700 bikes and 70 stations across San Francisco and San Jose.
Ford GoBike, like other bike share systems, consists of a fleet of specially designed, sturdy and durable bikes that are locked into a network of docking stations throughout the city. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. People use bike share to commute to work or school, run errands, get to appointments or social engagements and more. It's a fun, convenient and affordable way to get around.
The bikes are available for use 24 hours/day, 7 days/week, 365 days/year and riders have access to all bikes in the network when they become a member or purchase a pass.
Name: dataset.csv
Source: https://www.lyft.com/bikes/bay-wheels/system-data
The original link provided in the project (https://www.fordgobike.com/system-data) points to the above link.
File versions: 01/2018 - 04/2019
There are more data files available for the remaining months of 2019 but they are not used as there are some differences like different file names, additional fields, etc. which would require a lot of modification to be used with the remaining data.
Source: kepler.gl
Let's take a closer look on - San Francisco, East Bay and San José:
San Francisco and East Bay
San José
Let's find out - at first we will look on the average trip duration.
fig, axes = plt.subplots(figsize=(12, 5), dpi=110)
n = 1
for i, x in enumerate(["San Francisco", "East Bay", "San José"]):
df_new = df.query(f"label_name == '{x}'")
bin_size = 100
bins = np.arange(0, df_new.duration_sec.max()+bin_size, bin_size)
plt.hist(df_new.duration_sec, bins=bins, label=x,
color=sns.color_palette("viridis")[n], edgecolor="black", lw=0.4)
n += 2
plt.xticks(ticks=[x for x in range(0, 7000, 250)])
plt.legend()
plt.xlim(-100, 3500)
plt.title("Frequency of trip durations per area in seconds")
plt.xlabel("Seconds")
plt.tight_layout()
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
There are a lot more Subscribers than there are Customers. This suggests that there are a lot of people who use the service regularly either for work/school commute.
value_ct = df.user_type.value_counts().iloc[:31]
fig, ax = plt.subplots(figsize=(12, 5), dpi=110)
sns.countplot(x="user_type", data=df, order=value_ct.index,
lw=0.5, edgecolor="black")
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
for p in ax.patches:
ax.annotate('{:10.0f}%'.format(p.get_height()/(1906966+320033)
* 100), (p.get_x()+0.31, p.get_height()+40000))
plt.title("Users By Type")
plt.xlabel("")
Text(0.5, 0, '')
The next plots will focus on time components of our data.
It looks like the users use the bikes more frequently during the week than during the weekend.
fig, ax = plt.subplots(figsize=(12, 4), dpi=110)
sns.countplot(x="dayofweek", data=df, lw=0.5, edgecolor="black")
plt.tight_layout()
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
plt.title("Relative frequency of trips per day")
ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
plt.xlabel("")
plt.ylim(0, 500000)
for p in ax.patches:
ax.annotate('{:10.0f}%'.format(p.get_height()/len(df)*100),
(p.get_x()+0.1, p.get_height()+20000))
The most frequent starting hours are at 800hrs and at 1700hrs. Maybe people use it before and after work, which would make sense, because we have a lot of subscribers in working age in our dataset. You only subscribe to something, if you want to use it regularly. The integration into the working/study life would make sense here!
fig, ax = plt.subplots(figsize=(11, 5), dpi=110)
sns.countplot(x="start_hr", data=df, ax=ax, lw=0.5, edgecolor="black")
plt.tight_layout()
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
plt.title("Relative frequency of trips per starting hour")
plt.xlabel("Starting hour")
plt.ylim(0, 400000)
for p in ax.patches:
ax.annotate('{:10.1f}%'.format(p.get_height()/len(df)*100),
(p.get_x()-0.8, p.get_height()+15000))
ax.text(0-1.15, ax.patches[0].get_height()+13000,
'{:10.1f}%'.format(ax.patches[0].get_height()/len(df)*100))
Text(-1.15, 27911, ' 0.5%')
The frequency of bike usage at the weekend is lower, but the average duration of each trip is greater than during the week!
fig, ax = plt.subplots(figsize=(12, 7), dpi=110)
sns.boxplot(x="dayofweek", y="duration_sec", data=df.groupby(
["dayofweek", "month_year"], as_index=False).mean())
plt.tight_layout()
sns.despine(fig, left=True)
plt.xlabel("")
plt.ylabel("Duration in seconds")
plt.title("Average trip duration per day")
ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);
This trend applies for all areas, while we can also see that the users of San Francisco have, on average, the longest duration of trips, followed by East Bay and then San José.
fig, ax = plt.subplots(figsize=(12, 7), dpi=110)
sns.boxplot(x="dayofweek", y="duration_sec", data=df.groupby(
["dayofweek", "month_year", "label_name"], as_index=False).mean(), hue="label_name")
plt.tight_layout()
sns.despine(fig, left=True)
plt.xlabel("")
plt.ylabel("Duration in seconds")
plt.title("Average trip duration per day per area")
ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
fig, ax = plt.subplots(figsize=(12, 7), dpi=110)
sns.boxplot(x="dayofweek", y="start_hr", data=df.groupby(
["dayofweek", "month_year"], as_index=False).mean())
plt.tight_layout()
sns.despine(fig, left=True)
plt.xlabel("")
plt.ylabel("Starting hour")
plt.title("Average starting hour per day")
ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);
Looking at each area is interesting, because users from East Bay and San José are not only have shorter trip durations on average, but also they start their trips later than San Francisco on average.
fig, ax = plt.subplots(figsize=(12, 7), dpi=110)
sns.boxplot(x="dayofweek", y="start_hr", data=df.groupby(
["dayofweek", "month_year", "label_name"], as_index=False).mean(), hue="label_name")
plt.tight_layout()
sns.despine(fig, left=True)
plt.xlabel("")
plt.ylabel("Starting hour")
plt.title("Average starting hour per day per area")
ax.set(xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);
At first we will look at San Francisco.
We can see that most of the trips are close to the beach.
Now for East Bay